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Abstract: In this paper, we propose a semiparametric approach, called the nonpara- 

normal skeptic, for efRciently and robustly estimating high dimensional undirected 
graphical models. To achieve modeling flexibility, we consider Gaussian Copula graph- 
ical models (or the nonparanormal) as proposed by Liu et al. (2009). To achieve es- 
timation robustness, we exploit nonparametric rank-based correlation coefficient esti- 
mators, including Spearman's rho and Kendall's tan. In high dimensional settings, we 
prove that the nonparanormal skeptic achieves the optimal parametric rates of con- 
vergence for both graph and parameter estimation. This result suggests that Gaussian 
copula graphical models can be used as a safe replacement of the popular Gaussian 
graphical models, even when the data are truly Gaussian. Besides theoretical analy- 
sis, we also conduct thorough numerical simulations to compare the graph recovery 
performance of different estimators under both ideal and noisy settings. The proposed 
methods are then applied on a large-scale genomic datasct to illustrate their empirical 
usefulness. The R package huge implementing the proposed methods is available on the 
Comprehensive R Archive Network: http://cran.r-project.org/. 
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1. Introduction 

Undirected graphical models provide a powerful framework to explore the interrelationship 
among a large number of random variables. They have found routine use in analyzing complex 
and high dimensional data from high throughput genomic experiments, functional Magnetic 
Resonance Imaging (fMRI), and mass spectrometry analysis. 

An undirected graph G = {V,E) consists of a set of vertices V = {1, . . . ,d} and a set 
of unordered pairs E representing edges between the vertices. Each vertex i corresponds to 
a random variable X^. A joint distribution P for the random vector X — {Xi, . . . ,Xii)'^ 
is Markov to G — (V, E) if the following condition holds: X^ is independent of Xj given 
{Xk : k i, j) if and only if (i, j) ^ E.lna graph estimation problem, we have n observations 
of the random vector X, and want to estimate the edge set E. 

The most common method for estimating E when the dimension d is small is to assume 
that Y has a multivariate Gaussian distribution with inverse covariance matrix (also called 
the concentration matrix or the precision matrix) Q, then test the sparsity pattern of Q 
(Dempster, 1972; Edwards, 1995). Drton and Perlman (2007, 2008) have developed this 
method in detail. A drawback is that the dimensionality d must be strictly smaller than n. 

In the high dimensional setting when d > n, Meinshausen and Buhlmann (2006) propose 
a parallel regression based approach to estimate graphs under Gaussian models. They use 
the lasso (Chen et al., 1998; Tibshirani, 1996) to regress X^ on {Xj : j ^ i) in parallel for 
each Xi. Let 



with /3* := (/3|, . . . ,/3^)"^ and define the neighborhood of i by Ni := {j : fij ^ 0}. The lasso 
gives estimates /3* for all i. Let Ni := {j : /?] 7^ 0} and let E be the set of edges such 
that i e Nj and j ^ Ni. Under suitable sparsity assumptions they prove that P(A^i = N^) 1 




(1.1) 
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as n — >■ oo even if d — rf for some 7 > 0. Similarly, F{E = £■)—>■ 1 as n —)■ 00. For discrete 
data, Ravikumar et al. (2010) propose a similar procedure for estimating high dimensional 
Ising models or discrete Markov random fields. The main difference is that they replace the 
lasso with £i-regularized logistic regression. 

Alternatively, Yuan and Lin (2007) propose a penalized likelihood estimator 

Q — arg max <^ loglikelihood(n) - X\\Qjk\\ (1-2) 

where the loghkelihood of Q is evaluated under the Gaussian model, with Q as precision 
matrix. The estimator Q can be efficiently computed using the glasso algorithm (Banerjee 
et al., 2008; Friedman et al., 2008), which is a block coordinate descent procedure that uses 
the standard lasso to estimate a single row and column of Q in each iteration. The resulting 
estimator Q has been shown to have good theoretical properties (Lam and Fan, 2009; Peng 
et al., 2009; Ravikumar et al., 2009; Rothman et al., 2008). 

More recently a new method called the graphical Dantzig selector (or gDantzig) was in- 
troduced, which estimates the precision matrix column by column using the Dantzig selector 
(Yuan, 2010). In terms of £i-risk, the graphical Dantzig selector is minimax optimal over 
certain model class. Another graph and precision matrix estimation method called CLIME 
is developed by Cai et al. (2011). The rate of convergence and optimahty properties of the 
CLIME have also been estabhshed under mild conditions. 

Despite the popularity of the Gaussian graphical model, its normality assumption is rather 
restrictive. If this parametric normality assumption is correct, accurate and precise estimates 
can be expected. However, given the increasing complexity of modern datasets, conclusions 
inferred under such a restrictive assumption could be misleading. In fact, besides high di- 
mensionality, modern scientific datasets pose two additional challenges: 

• The distributions of the data are in general non-Gaussian; 

• The data could be noisy (e.g. contaminated by outliers). 

To handle the first challenge, Liu et al. (2009) propose the nonparanormal, a semiparametric 
Gaussian copula (Klaassen and Wellncr, 1997; Tsukahara, 2005), which relaxes the Gaussian 
assumption. A random vector X belongs to a nonparanormal family if there exists a set of 
univariate monotone functions {fj}j=i such that f{X) := (/i(Xi), . . . , fd{Xdj)^ is Gaussian. 
They provided a learning algorithm that has the same computational cost as the glasso. They 
first use a Winsorized estimator to estimate the marginal transformations fj, then estimate 
the precision matrix using the transformed data. A rate of convergence IS 
established for estimating the precision matrix in terms of Frobenius and spectral norms. 
However, it was not clear if their obtained rate of convergence is optimal. 

In this paper we show that the rate of convergence obtained by Liu et al. (2009) is not opti- 
mal. We present an alternative procedure that simultaneously achieves estimation robustness 
and rate optimality. The main idea is to exploit robust nonparametric rank-based statistics 
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including Spearman's rho and Kendall's tau to directly estimate the unknown correlation ma- 
trix, without explicitly calculating the marginal transformations. We call this approach the 
nonparanormal skeptic (since the Spearman/Kendall estimates preempt transformations 
to infer correlation). The estimated correlation matrix is then plugged into existing para- 
metric procedures (graphical lasso, CLIME, or graphical Dantzig selector) to obtain the final 
estimate of the inverse correlation matrix and the graph. 

By leveraging existing analysis of different parametric methods (Cai et al., 2011; Lam and 
Fan, 2009; Ravikumar et al., 2009; Yuan, 2010), we prove that although the nonparanormal 
is a strictly larger family of distributions than the Gaussian, the nonparanormal skeptic 
achieves the optimal parametric rate in terms of both precision matrix estimation and graph 
recovery. This result suggests that the extra modeling flexibility and robustness come at 
almost no cost of statistical efficiency. Thus, this new estimator can be used as a safe re- 
placement for Gaussian estimators even when the data arc truly Gaussian. Moreover, by 
avoiding the estimation of the transformation functions, this new approach has fewer tuning 
parameters than the nonparanormal estimator proposed by Liu et al. (2009). 

We provide careful numerical studies to support our theory. Our results show that, when 
the data contamination rate is low, the normal-score based nonparanormal estimator pro- 
posed by Liu et al. (2009) is slightly more efficient than the nonparanormal skeptic. How- 
ever, when the data contamination rate is higher, the nonparanormal skeptic clearly out- 
performs the normal-score based estimators. This result reflects an interesting tradeoff of 
statistical efficiency with estimation robustness. 

The remainder of the paper is organized as follows. In the next section we briefly re- 
view some background of the nonparanormal estimator from Liu et al. (2009). In Section 3 
we present the nonparanormal SKEPTIC estimator, which exploits the Spearman's rho and 
Kendall's tau statistics to estimate the underlying correlation matrix. We present a theoret- 
ical analysis of the method in Section 4, with more detailed proofs collected in the appendix. 
In Section 5 we present numerical results on both simulated and real data, where the prob- 
lem is to construct large undirected graphs for different biological entities (different tissue 
types or genes) using very large-scale genomic datasets. We then discuss the connections to 
existing methods and possible future directions in the last section. 



2. Background 

In this section we briefly describe the nonparanormal family and the normal-score based 
graph estimator proposed by Liu et al. (2009). 

Let A = [Ajk] e K'^'^'^ and v = (vi, . . . , va)^ e M.'^. For 1 < g < oo, we define 

/ d \ v« 
lli'llq = I \vi\'' j and ||t'||oo = '^^^ I'^il- 
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For 1 < g < oo, we define the matrix £q-operator norms: 

sup 



For q — 1 and q — oo, the matrix norm can be more explicitly represented as 

d d 

\\A\\i^ma^^J2\Aij\ and ||A||oo=m^^^ |Aj|■ 
-'- i=l 3=1 

The matrix ^2-operator norm is the leading singular value and is often called the spectral 
norm. We also define ||A||inax — niaxj^fe \Aj}. I and ll^llln = Y.j,k \^3k?- We denote 



and similarly denote by the submatrix of A obtained by removing the i^^ row and j^^ 

column. We use to represent the i^^ row of A with its j^^ entry removed. The notations 
Amin(^) and Aniax(^) are used for the smallest and largest singular values of A. 



2.1. The Nonparanormal 

The general form for a strictly positive probability density encoded by an undirected graph 
G is 

P(^) = ^exp[ Yl fci^c)], (2.1) 

^"^^ \CeCliques(G) / 

where the sum is over all cliques, or fully connected subsets of vertices of the graph. In general, 
this is what we mean by a nonparametric graphical model. It is the graphical model analogue 
of the general nonparametric regression model. Model (2.1) has two main ingredients, the 
graph G and the functions {fc}- However, without further assumptions, it is much too 
general to be practical. The main difficulty in working with such a model is the normalizing 
constant Z{f), which cannot, in general, be efficiently computed or approximated. 

The nonparanormal distribution can be viewed as a subclass of nonparametric graphical 
models with more restrictive constraints. As has been discussed in Liu et al. (2009), the 
nonparanormal approach parallels the ideas behind sparse additive models for regression. 
Specifically, we replace the random variable X — {Xi, . . . , X^)'^ by the transformed random 
variable f{X) — {fi{Xi), . . . , fd{Xd))^ , and assume that f{X) is multivariate Gaussian. This 
results in a nonparametric extension of the Normal. The nonparanormal only depends on the 
univariate functions {/j}j=i and the correlation matrix E°, all of which are to be estimated 
from data. While the resulting family of distributions is much richer than the standard 
parametric Normal (the paranormal), the independence relations among the variables are 
still encoded in the precision matrix Vt^ = (S°) ^, as we show below. 
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Definition 2.1 Nonparanormal. Let / = {fi, ■ ■ ■ , fd} be a set of monotone univariate 
functions and E° e M'^^'' be a positive-definite correlation matrix, with 

diag = 1. (2.2) 

We say a d-dimensional random variable X — {Xi, . . . , X^)'^ has a nonparanormal distribu- 
tion X ~ NPNaif, E") if f{X) := {f,{X,), . . . , ~ 7V(0, E^). 

For continuous distributions, the nonparanormal family is equivalent to the Gaussian cop- 
ula family (Klaassen and Wellner, 1997; Tsukahara, 2005). It is clear that the nonparanormal 
family is much richer than the Normal family. 

Let = (SO)"' be the precision matrix. Liu et al. (2009) have been proved that the 
sparsity pattern of fl'^ encodes the undirected graph of X; that is, 

^% = <^ Xj AL Xk \ X\{j^k}- (2.3) 

Therefore, even though the nonparanormal family is larger than the Gaussian family, its 
conditional independence graph is encoded by the sparsity pattern of 0°. 

2.2. The Normal-score based Nonparanormal Graph Estimator 

Liu et al. (2009) suggest a two-step procedure to estimate the graph. 

1. Replace the observations, for each variable, by their respective normal-scores. 

2. Apply the glasso to the transformed data to estimate the undirected graph. 

More specifically, let x^, . . . , e R'' be n data points and let /(•) be the indicator 
function. We define 

i=l 

to be the scaled empirical cumulative distribution function of Xj. Liu et al. (2009) study the 
estimator of the nonparanormal transformation functions given by^ 

f,{t) = (TsjF.it)]) , j = l,...,(i, 

where $"'(■) is the standard Gaussian quantile function and Ts^ is a Winsorization (or 
truncation) operator defined as 

Ts^{x) := Sn ■ I{X <Sn)+X- I{Sn < X < 1 - (5„) + (1 - (5^) • /(x > 1 - (5„). 

^Instead of Fj, Liu et al. (2009) use the standard empirical cumulative distribution function. These two 
estimators are asymptotically equivalent. 
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Let S'^^ — [S^^] be the correlation matrix of the transformed data, where 



1 " ^ 



STk - -^^^^^^^=- (2-5) 




The nonparanormal estimate of the inverse correlation matrix QP^ can be obtained by plug- 
ging S^^ into the glasso. 

Taking 5„ = — -j—^ , we call S^^ the normal-score rank correlation coefficient. For bivariate 

Gaussian copula distributions, Klaassen and Wellner (1997) prove that Sj^ is semiparametric 
efficient in estimating E°j.. However, their efficiency result can not be generalized to the high 
dimensional setting. The main reason is that the standard Gaussian quantile function $~^(-) 
diverges very quickly when it is evaluated at a point close to 1. To handle high dimensional 
cases, Liu et al. (2009) suggest a truncation level 

Such a truncation level 5„ is chosen to control the tradeoff of bias and variance in high 
dimensions. They analyze the high dimensional scaling of the precision matrix estimator Q'^ 
and show that 




s + d) log d + log n 



- 0^11^ = Op\\l ^ ' ^ I , (2.7) 



(2.8) 



where s :— Card ({(j, /c) G {1, . . . , (i} x {1, . . . , d} | V^^^ 7^ 0, j 7^ /c}) is the number of nonzero 
off-diagonal elements of the true precision matrix. 

Using the results of Ravikumar et al. (2009), it can also be shown that, under appropriate 
conditions, the sparsity pattern of the precision matrix can be accurately recovered with 
high probability. In particular, the nonparanormal estimator Q"^ satisfies 



p(^^ > 1-0(1) 



where ^?(Q°^ Q°) is the event jsign {^"f^ = sign {^l]^ , Vj, A; e {1, . . . , d}}. We refer to Liu 
et al. (2009) for details of the technical conditions and proofs. 

As has been discussed by Liu et al. (2009), it was not clear whether the obtained rates 
in (2.7) and (2.8) are optimal or not. In this paper, we show that these rates are indeed not 
optimal and can be greatly improved using different estimators. 
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3. The Nonparanormal skeptic 



In this section we propose a different approach for estimating that achieves a much faster 
rate of convergence, without explicitly estimating the transformation functions. 

3.1. Main Idea 

The main idea behind our alternative procedure is to exploit the Spearman's rho and 
Kendall's tau statistics to directly estimate the unknown correlation matrix, without ex- 
plicitly calculating the marginal transformation functions {fj}'^^^- 

1 " 

Let be the rank of among x], . . . , and fj = — r]. We consider the following 

1=1 

statistics: 

(bpearman s rho) pjk — 



(Kendall's tau) fjfe = ^ sign (^(x} - a;}') {x\ - 4')) . 

^ ' l<i<i'<n 

Both can be viewed as a form of nonparametric correlation between the empirical realizations 
of random variables Xj and X^. Note that these statistics are invariant under monotone 
transformations. For Gaussian random variables there is a one-to-one mapping between these 
two statistics; details can be found in Kruskal (1958). Let Xj and Xk be two independent 
copies of Xj and Xk- We denote by Fj and Fk the CDFs of Xj and Xk- The population 
versions of Spearman's rho and Kendall's tau are given by 

p,fc:=Corr(F,(X,),F,(Xfc)), (3.1) 
Tjk := Corr (sign{Xj - Xj),sign{Xk - Xk)^ . (3.2) 

Both pjk and Tjk are association measures based on the notion of concordance. We call 
two pairs of real numbers (s, t) and {u, v) concordant if {s — t){u — v) > and disconcordant 
if (s — t){u — v) < 0. The following proposition provides further insight into the relationship 
between pj^ and Tjk- The proof is provided in the appendix for completeness. 

Proposition 3.1. Let {X^^\xj^^^),{xf ,xl^^), and (xf\xf^) be three independent ran- 
dom vectors with the same distribution as {Xj,Xk). Define 

C(j, t; k, u, v) = P((xf - )(X(") - Xl'^)) > 0), 
D(j, t; k, u, v) = P((xj^) - xf){X^^^ - ) < 0) . 

Then pjk = 3C(j, 1, 2; k, 1, 3) - 3D(j, 1, 2; /c, 1, 3) and t.u = C(j, 1, 2; fc, 1, 2) - D(j, 1, 2; k, 1, 2). 
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For Gaussian copula distributions, the following important lemma connects the Spear- 
man's rho and Kendall's tau to the underlying Pearson correlation coefficient E^j^.. 

Lemma 3.1 (Fang et al. (2002); Kruskal (1958)). Assuming X ~ NPNd{f,T,^), we have 

= 2 sin (^Pjk) = sin (|T,fc) . (3.3) 

Motivated by this lemma, we define the following estimators = [Sj^.] and S'^ = [5^^] 
for the unknown correlation matrix S'^: 




As will be shown in later sections, the final graph estimators based on the Spearman's rho 
and Kendall's tau statistics have similar theoretical performance. In the following sections 
we omit the superscripts p and r and simply denote the estimated correlation matrix by S. 



3.2. The Nonparanormal skeptic with Different Graph Estimators 

The estimated correlation matrices S"^ and S'' can be directly plugged into different para- 
metric Gaussian graph estimators to obtain the final precision matrix and graph estimates. 



3.2.1. The Nonparanormal skeptic with the Graphical Dantzig Selector 

The main idea of the graphical Dantzig selector (Yuan, 2010) is to take advantage of the con- 
nection between multivariate linear regression and entries of the inverse covariance matrix. 
The detailed algorithm is listed in the following, where 5 is the tuning parameter. 

• Estimation: For j — 1, . . . ,d, calculate 

6^ = argmin ||6'||i subject to \\S\jj - S\j^\jO\\oo < S, (3.4) 



(3.5) 



Symmetrization : 



and — —Q,jj9K (3.6) 



QgDS ^ argmin ||Q - (3.7) 

Q.=QT 



Within each iteration, the Dantzig selector in (3.4) can be formulated as a linear pro- 
gram. A more sophisticated path algorithm (DASSO) to solve the Dantzig selector has been 
developed by James et al. (2009). 
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3.2.2. The Nonparanormal skeptic with the CLIME 



Let be the rf-dimensional identity matrix. The estimated correlation coefficient matrix S 
can also be plugged into the CLIME estimator (Cai et al., 2011), which is defined by 

^CLiME ^ argmin V 1^^! subject to \\Sn - Id||max < A, (3.8) 

where A is the tuning parameter. Cai et al. (2011) show that this convex optimization can 
be decomposed into d vector minimization problems, each of which can be cast as a linear 
program. Thus the CLIME has the potential to scale to large datasets. 

3.2.3. The Nonparanormal skeptic with the Graphical Lasso 

We can also plug the estimated correlation matrix S into the graphical lasso: 

^giasso _ ^j.g ^-^ j f^^^^ -\og\n\ + Xj2\^jk\ \ ■ (3.9) 

One thing to note is that S may not be positive semidefinite. Even though the formula- 
tion (3.9) is still convex, certain algorithms (like the blockwise-coordinate descent algorithm 
(Friedman et al., 2008)) may fail if S is not positive semidefinite. However other algorithms 
like two-metric projected Newton method or first-order projection do not have such positive 
semidefinite assumption on S. These algorithms can be directly exploited to efficiently solve 
(3.9). 

Unlike the graphical Lasso formulation, the graphical Dantzig selector and CLIME can 
both be formulated as linear programs, so they do not require positive semidefiniteness of 
the input correlation matrix. 

3.2. J^. The Nonparanormal skeptic with the Parallel Regression based Graph Estimator 
(The Meinshausen-Biihlmann procedure) 

The nonparanormal skeptic can also be applied with the Meinshausen-Biihlmann procedure 
to estimate the graph. As has been discussed in Friedman et al. (2008), the correlation matrix 
is also a sufficient statistic for the Meinshausen-Biihlmann procedure. However, in this case, 
we need to make sure that S is positive semidefinite. Otherwise, the algorithm may not 
converge. Practically, we can first project S into the cone of positive semidefinite matrices. 
In particular, we need to solve the following convex optimization problem: 

,5 = argmin||^-5|Uax. (3.10) 

Here we use the elementwise sup-norm || ■ ||max instead of the Frobenius norm || ■ \\f, due 
to theoretical concerns developed in the next section. In fact, it is easy to show that the 
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optimization problem in (3.10) can be formulated as the dual of a graphical lasso problem. 
To find the projection sohition. wc need to search for the smallest possible tuning parameter 
which still makes the optimization problem feasible. Empirically, we can use a surrogate 
projection procedure that computes a singular value decomposition of S and truncates all 
of the negative singular values to be zero. We find that this procedure works well. 

3.3. Computational Complexity 

Compared to the corresponding parametric methods like the graphical lasso, graphical Dantzig 
selector, CLIME, and the Meinshausen-Biihlmann estimator, the only extra cost of the non- 
paranormal SKEPTIC is the computation of S, which requires us to calculate d{d— l)/2 pair- 
wise Spearman's rho or Kendall's tau statistics. A naive implementation of the Kendall's 
tau matrix requires 0{dP'n^) computation. However, efficient algorithm based on sorting and 
balanced binary trees has been developed to calculate Kendall's tau statistic with a compu- 
tational complexity 0{(fn\ogn). Details can be found in Christensen (2005). 

If we assume that each data point is unique (no "ties" in computing ranks) , then Spear- 
man's rho statistic can be written as 

where is the rank of among x^, . . . , x'^. Therefore, once the ranks are given, the statistic 
can be computed very efficiently. Calculating has the cost 0{d?n\ogn). 

3.4' Estimating Marginal Transformations 

We estimate the marginal transformation fj using the following estimator: 

^•(x):=$-^(F,(a;)), (3.12) 

where Fj{t) is defined as 

m = itH^)<t)-l{t<t<l-i) (3.13) 

It's easy to see that for any fixed t, fj{t) converges in probability to fj{t). Theorem 3.1 
provides a stronger result that fj converges to fj uniformly over an expanding interval with 
high probability. The proof is provided in Appendix B. 
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Theorem 3.1. Let gj :— fj ^ be the inverse function of fj. For any < 7 < 7/4, we define 



9j 



7 



-7 log« ,9j 



7 



— 7 ) log n 



(3.14) 



then sup 



m-m =op{i) 



4. Theoretical Properties 

In this section we analyze the statistical properties of the nonparanormal skeptic estimator. 
Our main result shows that S'' and S'^ estimate the true correlation matrix E° at the optimal 
parametric rate in high dimensions. Such a result allows us to leverage existing analysis of 
different parametric methods (e.g. graphical lasso, graphical Dantzig, and CLIME) to analyze 
the nonparanormal skeptic estimator. 

Our main result, Theorem 4.3, provides a general statement that the nonparanormal 
SKEPTIC achieves the same graph recovery and parameter estimation performance as the 
corresponding parametric methods. Since the nonparanormal family is much richer than 
the Gaussian family, such a result suggests that the nonparanormal skeptic can be a safe 
replacement for the Gaussian graphical model. We then use the graphical Dantzig selector 
as an illustrative example to showcase this result. Similar analyses can be carried out for the 
CLIME and graphical lasso. To simplify the analysis, we assume that there are no ties in 
the ranks assigned to data points. 



4'1- Concentration Properties of the Estimated Correlation Matrices 

We first prove the concentration properties of the estimators and S'^. Let E^j^. be the 
Pearson correlation coefficient between fj{Xj) and fki^k)- In terms of the || • Umax norm, we 
show that both S'' and S'^ converge to E° in probability at the optimal parametric rate. Our 
results are based on different versions of Hoeffding's inequahties for U-statistics. Without loss 
of generality, in this paper we always assume d > n. The results ior d < n are straightforward. 
The proof of both theorems can be found in Appendix A. 

Theorem 4.1. For any < a < 1, whenever 
we have 

P(.up|s^-E»,|>— ^^j<-. (4.2) 
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Therefore, let a — 



3v^ 



. Then with probability at least 1 — 2/(f, for n > 



21 



+ 2, we have 
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logo? 



sup S^^ 



jk 




logd 



n 



(4.3) 



The next theorem illustrates the concentration property of S"^. 
Theorem 4.2. For any n> 1, with probability at least \ ~ 1/d, we have 



sup S], - ^% < 2.457r/ 

jk V 



(4.4) 



The above results lead to the following "meta-theorem" , which implies that even though 
the nonparanormal skeptic is a semiparametric approach, it achieves the optimal parametric 
rate in high dimensions. 

Theorem 4.3 (Main Result). Plugging or S'^ into the graphical lasso (or graphical 
Dantzig selector, or CLIME), under the same conditions on S° that ensure estimation or 
graph recovery consistency for these parametric methods (under Gaussian models), the non- 
paranormal SKEPTIC achieves the same (parametric) rates of convergence for precision ma- 
trix estimation or graph recovery under nonparanormal models. 

Proof. The proof is based on the key observation that the sample correlation matrix S is 
a sufficient statistic for all three methods: the graphical lasso, graphical Dantzig selector, 
and CLIME. By examining the analyses of Ravikumar et al. (2009), Yuan (2010), and Cai 
et al. (2011), a sufficient condition on S that secure both estimation consistency and graph 
recovery consistency is that there exists some constant c such that 



The graphical lasso, graphical Dantzig selector, and CLIME have been proved to be 
minimax rate optimal over certain parameter spaces under Gaussian models. Since the non- 
paranormal family is strictly larger than the Gaussian family, we immediately obtain the 
minimax optimality of the nonparanormal estimator: 

CoroIIciry 4.1. Over all the parameter spaces ofT!^ such that the graphical lasso, graphical 
Dantzig selector, or CLIME is minimax rate optimal under Gaussian models, the correspond- 
ing nonparanormal SKEPTIC estimator is also minimax rate optimal for the same parameter 
space of under nonparanormal models. 

The key message conveyed by the main theorem and the above corollary is that, in terms 
of rates of convergence, the nonparanormal SKEPTIC is a safe replacement for the Gaussian 
graphical models. The extra flexibility and robustness come at very little cost. 




(4.5) 



which can be replaced by (4.3) and (4.4) from Theorems 4.1 and 4.2. 



□ 
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Remcirk 4.1. Even though in this section we mainly present the results about the graphical 
Dantzig selector, graphical lasso, and CLIME, similar arguments as in Theorem 4.3 hold for 
almost all methods that use the correlation matrix E° as a sufficient input statistics. For 
example, if we plug the projected estimator S from (3.10) into the Meinshausen-Biihlmann 
procedure, the resulting estimator can be proven to be graph recovery consistent and achieve 
the optimal parametric rate of convergence. 



4-2. Applying the Nonparanormal skeptic with the Graphical Dantzig 
Selector 

In Theorem 4.3 we show that the nonparanormal SKEPTIC estimator S can be plugged 
into any parametric graph estimation procedure and achieves the same parametric rate of 
convergence. In this subsection we use the graphical Dantzig selector to illustrate more 
details. 

Denote by O'^p"'^ the inverse correlation matrix estimated using the nonparanormal SKEP- 
TIC with the graphical Dantzig selector in (3.7). Given a matrix Q, we define deg(Q) = 
m8iKi<i<(i'^j=i I 7^ 0). Following Yuan (2010), we consider a class of inverse correla- 
tion matrices defined by 

Mi{k,t,M) := 0,diag(f^-^) = 1, \\Q\\i < k, 

max 

(Q) < T,deg(Q) < m}, 

where k, r > 1. 

Recall that 0° = (E°) , we have the following corollary of Theorem 4.3. 

Theorem 4.4. For 1 < q < oo, there exists a constant Ci that depends on k, t, Xmini^'^), 
and Ainax(^°); such that 



sup ||1]"P°- - nX = Op I M I , (4.6) 

QPeMi{K,T,M) \ \ n 



n I log d 

provided that lim — -— = oo and 5 = Ci\ , for sufficiently large Ci. Here 5 is the 

log d \ n 



tuning parameter used in (3.4). 

Proof. The proof is directly obtained by replacing Lemma 12 in Yuan (2010) with the result 
of Theorem 4.3. □ 

The next theorem establishes the minimax lower bound for inverse correlation matrix 
estimation over the class M.i{n, r, M). 
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1 /2 

Theorem 4.5 (Yuan (2010)). Let M (logd/n) ' — o(l). Then there exists a constant C > 
depending only on k and r such that 



liminf inf sup P ||^^ - > CM\ > 0, 

where the infimum is taken over all estimates of fl based on the observed data x^, . . . ,x"'. 



5. Experimental Results 

In this section we compare different methods on both synthetic and real datasets. We consider 
the following methods: 

• npn - the original nonparanormal estimator proposed by Liu et al. (2009). 

• normal - the Gaussian graphical model (either the graphical lasso or the Meinshausen- 
Biihlmann procedure as indicated by the context). 

• npn-spearman - the nonparanormal skeptic using the Spearman's rho. 

• npn-tau - the nonparanormal skeptic using the Kendall's tau. 

• npn-ns - the normal-score based estimator in (2.5) with Sn — . 

n + 1 

5. 1 . Summary of the Experimental Results 

Let A and B be two graph estimation procedures. To compare their graph recovery perfor- 
mance, we use A >siightiy B to represent that A slightly outperforms B; A > B means A is 
better than B; A^ B means A is significantly better than B; while A^ B means A and B 
have similar performance. 

Our main conclusions are listed as below: 

• non-Gaussian data with no outlier: npn-ns ~ npn npn-spearman npn-tau ^ normal. 

• non-Gaussian data with low level of outliers: npn-tau npn-spearman > npn > npn-ns 
normal. 

• non-Gaussian data with higher level of outliers: npn-tau > npn-spearman 
3> npn > npn-ns 3> normal. 

• Gaussian data with no outlier: normal npn-ns npn >siightiy 
npn-spearman ^ npn-tau. 

• Gaussian data with low level of outhers: npn-tau npn-spearman > npn > npn-ns » 
normal. 

• Gaussian data with higher level of outliers: npn-tau > npn-spearman > npn > npn-ns > 
normal. 

The above results illustrate a tradeoff of estimation robustness and statistical efficiency. 
For nonparanormal data with no outlier, the npn-ns and npn behave similar to the npn- 
tau and npn-spearman. However, if the data are contaminated by outliers, the npn-tau and 
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npn-spearman outperform npn-ns and npn even when the contamination proportion is low. 
Overall, our simulations suggest that both the npn-tau and npn-spearman have a good balance 
of statistical efficiency and robustness. 

Besides numerical simulations, we also apply our method on a large-scale genomic dataset 
as an empirical case study. The implementations of these methods can be found in our R 
package named huge, which is freely available from GRAN. 

5.2. Numerical Simulations 

We adopt the same data generating procedure as in Liu et al. (2009). To generate a d- 
dimensional sparse graph G = {V.E), let V = {l,...,d} correspond to variables X = 
(Xi, X(i). We associate each index j G {!,...,(/} with a bivariate data point (l^'-^'', Y^*''^'') G 
[0, 1]2 where 

y}''\ . . . , yJ'^^^'UaUniformfO, 1] 
for k — 1,2. Each pair of vertices is included in the edge set E with probability 

where j/j := {yi^\yi'^^) is the empirical observation of {Y-^^\Y-^'^^) and || • || represents the 
Euclidean distance. Here, s = 0.125 is a parameter that controls the sparsity level of the 
generated graph. We restrict the maximum degree of the graph to be 4 and build the inverse 
correlation matrix according to Q^^. = 1 if j = A;, = 0.245 if (j, k) G E, and = 
otherwise. Here the value 0.245 guarantees positive definiteness of il^. Let S° = (fi*^) ^ . To 
obtain the correlation matrix, we simply rescale S° so that all its diagonal elements are 1. 
We then sample n data points x^, . . . , from the nonparanormal distribution NPNd{P, S°), 
where for simplicity we use the same univariate transformations on each dimension, i.e., 
/° = . . . = /° = /°. To sample data from the nonparanormal distribution, we also need 
g'^ :— (/°)~^. The following two different versions of are used in the simulation: 

Definition 5.1. (Gaussian GDF Transformation) Let Qq be a univariate Gaussian cumula- 
tive distribution function with mean /ig^ and the standard deviation ag^: 

9o{t) := $ ('-^) . 

The Gaussian GDF transformation — (/°)~^ for the j-th dimension is defined as 

go{z,) - [ goim (^) dt 

y / (^o(y) - / gom (^^) d^ ct> (^) dy 
where (f){-) is the Standard Gaussian density function. 
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Definition 5.2. (Power Transformation) Let go{t) :— sign(i)|i|" where a > is a parameter. 
The power transformation for the j-th dimension is defined as 



where 0(-) is the standard Gaussian density function. 

These two transformations have been proposed by Liu ct al. (2009) to study the perfor- 
mance of the original nonparanormal estimator. To comply with their simulation design, for 
the Gaussian CDF transformation we set /ig^ — 0.05 and ag^ — 0.4; for the power transfor- 
mation, we set q; = 3. 

To generate synthetic data, we set d — 100, resulting in (^2°) + 100 = 5, 050 parameters 
to be estimated. The sample sizes are varied from n = 100, 200 to 500. Three conditions are 
considered, corresponding to using the power transformation, the Gaussian CDF transfor- 
mation, and linear transformation (or no transformation)^. 

To evaluate the robustness of these methods, we consider two types of data contamination 
mechanisms: the deterministic contamination vs. random contamination. Let r G (0, 1) be 
the contamination level. For deterministic contamination we replace [nr\ data points with 
a deterministic vector (-1-5, —5, -|-5, —5, -|-5, . . .)^ G R'^, in which the numbers -|-5 and —5 
occur in an alternating way. For random contamination, we randomly (according to uniform 
distribution) select \nr\ entries of each dimension and replace them with either -|-5 or —5 
with equal probability. From the robustness point of view, the deterministic contamination 
is more malicious and can severely hurt non-robust procedures. In contrast, the random 
contamination is relatively benign and is more realistic for modern scientific data analysis. 

Both the normal-score based nonparanormal estimators (npn and npn-ns) and the non- 
paranormal SKEPTIC estimators (npn-spearman and npn-tau) are two-step procedures. In the 
first step we obtain an estimate S of the correlation matrix; in the second step we plug 
S into a parametric graph estimation procedure. In this numerical study, we consider two 
parametric baseline procedures: (i) the graphical lasso and (ii) the Meinshausen-Biihlmann 
graph estimator. The former one represents the likelihood-based approach and the latter 
one represents the pseudo-likelihood based approach. For empirical applications, we found 
that the CLIME has a similar behavior to the graphical lasso, while the graphical Dantzig 
selector behaves similar to the Meinshausen-Biihlmann method. Our implementations of the 
nonparanormal SKEPTIC, graphical lasso and Meinshausen-Biihlmann methods are available 
in the R package huge'^. 

Let G — (y, E) be a d-dimensional graph. We denote to be the number of edges in the 
graph G. We adopt false positive and false negative rates to evaluate the graph estimation 

^For linear transformation, the data exactly follow the Gaussian distribution. 

■'http://cran.r-projcct.0rg/web/packages/R.huge/. The package huge corrects some non-convergence 
problem of the glasso package. 




(5.3) 
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performance. Let — (V, E^) be an estimated graph using the regularization parameter A 
in either the graphical lasso procedure (3.9) or the Meinshausen-Biihhiiann procedure. The 
number of false positives when using the regularization parameter A is 

FP(A) :— the number of edges in but not in E. (5.4) 
The number of false negatives at A is defined as 

FN(A) := the number of edges in E but not in E^. (5.5) 
We further define the false negative rate (FNR) and false positive rate (FPR) as 

FNR(A) ^ and FPR(A) FP(A)/[Q " l^l])" (^"6) 

Table 1 

Quantitative comparison of the 5 methods on simulated datasets using different nonparanormal 
transformations. The graphs are estimated using the glasso algorithm with deterministic data 
contamination. Note: "npn" is the Winsorized normal-score nonparanormal estimator from Liu et al. 
(2009); "normal" is the naive Gaussian graph estimator; "Spearman" represents the nonparanormal 
SKEPTIC using Spearman's rho; "Kendall" represents the nonparanormal skeptic using Kendall's tau; 
"npn-ns" represents the nonparanormal skeptic using normal-score rank correlation coefficient. 









npn 




npn- 


■ns 


normal 


spearman 


ker 


idall 


tf 


r 


n 


FPR(%) 


FNR 


FPR 


FNR 


FPR 


FNR 


FPR 


FNR 


FPR 


FNR 


cdf 


0.00 


100 
200 
500 


11(2.9) 
6(2) 
2(1.6) 


13(3.5) 
5(2.1) 
1(1.2) 


11(3.1) 
6(1.9) 
3(1.7) 


13(3.6) 
6(2.5) 
1(1.1) 


26(6.9) 
18(6.7) 
11(4.2) 


38(9.2) 
32(17.2) 
19(20.9) 


11(3.4) 
6(2.2) 
3(1.6) 


15(3.6) 
6(2.4) 
2(1.4) 


11(3.2) 
6(2.1) 
3(1.6) 


15(3.6) 
6(2.4) 
2(1.4) 




0.01 


100 
200 
500 


14(3.8) 
12(3.7) 
4(1.6) 


15(3.9) 
16(4.5) 
5(2) 


16(4.4) 
24(7.8) 
7(2.4) 


15(4.5) 
13(6.7) 
8(2.7) 


33(8) 
40(9.7) 
40(9.3) 


38(11.4) 
28(15.8) 
17(14.2) 


13(3.1) 
10(2.7) 
3(1.5) 


16(3.8) 
12(3.4) 
3(1.6) 


13(3.2) 
10(2.8) 
3(1.4) 


16(3.9) 
12(3.1) 
3(1.6) 




0.05 


100 
200 
500 


27(2.6) 
36(2) 
33(1.3) 


12(3.5) 
7(2) 
1(0.9) 


26(2.4) 
37(2) 
33(1.2) 


12(3.5) 
7(2) 
1(1) 


40(10.4) 
37(13.8) 
43(10.7) 


40(13) 
35(24.4) 
21(17.4) 


25(2.3) 
36(2.4) 
31(1.4) 


14(3.3) 
8(2.5) 
1(1) 


27(2.9) 
36(2.3) 
31(1.5) 


13(3.2) 
8(2.7) 
1(1.2) 


linear 


0.00 


100 
200 
500 


11(3.2) 
6(2.1) 
2(1) 


13(3.7) 
5(2) 
1(1.1) 


11(2.9) 
5(2) 
2(1.1) 


13(3.1) 
5(2) 
1(1) 


11(2.8) 
5(1.5) 
2(0.9) 


12(3.2) 
6(4.1) 
1(0.7) 


11(2.6) 
6(2) 
2(0.9) 


14(3.5) 
6(2.1) 
1(1.2) 


11(2.8) 
6(2.1) 
2(0.9) 


16(3.6) 
6(2.3) 
1(1.2) 




0.01 


100 
200 
500 


14(3.3) 
13(4.4) 
5(2.1) 


16(4.1) 
16(4.6) 
5(2.3) 


16(4.3) 
27(5.9) 
7(2.3) 


16(4.8) 
11(5.6) 
10(3.4) 


25(3.3) 
37(4) 
33(2.9) 


13(7.6) 
6(8.2) 
2(3.6) 


13(3.5) 
10(2.7) 
3(1.2) 


16(4) 
12(3.2) 
3(1.6) 


13(3.8) 
9(2.9) 
3(1.3) 


16(4.5) 
12(3.3) 
3(1.6) 




0.05 


100 
200 
500 


26(2.4) 
37(1.9) 
33(1.4) 


12(3.2) 
7(3) 
1(1) 


27(2.6) 
37(1.9) 
33(1.3) 


12(3.3) 
7(2.9) 
1(1.1) 


35(4.9) 
37(5.6) 
36(3.3) 


17(7.5) 
7(12.1) 
6(6.8) 


26(2.4) 
36(2.4) 
31(1.4) 


13(3.4) 
8(2.8) 
1(1) 


27(2.5) 
37(2.6) 
31(1.4) 


13(3.1) 
8(2.8) 
1(1.1) 


power 


0.00 


100 
200 
500 


11(2.9) 
6(2.7) 
2(1.5) 


13(3.4) 
5(2.4) 
1(1.1) 


11(3.2) 
6(2.9) 
2(1.4) 


13(3.4) 
5(2.2) 
1(1.1) 


25(5) 
19(4.2) 
9(2.3) 


32(6.7) 
18(6.4) 
8(3) 


11(3.3) 
6(2.7) 
2(1.3) 


14(3.6) 
6(2.7) 
1(1.3) 


12(3.5) 
6(2.6) 
2(1.5) 


14(3.7) 
6(2.7) 
1(1.3) 




0.01 


100 
200 
500 


14(3.5) 
12(3.5) 
5(1.6) 


16(4.4) 
17(4.3) 
5(2) 


16(3.8) 
21(7.2) 
5(1.9) 


16(4.4) 
15(7.5) 
7(2.3) 


33(5.2) 
60(8.6) 
40(4.6) 


32(6.1) 
23(13.1) 
13(6.1) 


13(3.6) 
10(2.8) 
3(1.4) 


16(4.2) 
12(3.3) 
3(1.4) 


13(3.3) 
9(2.7) 
3(1.3) 


16(3.9) 
12(3.6) 
3(1.6) 




0.05 


100 
200 
500 


26(2.3) 
37(2.1) 
33(1.4) 


12(3.1) 
8(3.1) 
1(1.1) 


26(2.2) 
37(2.1) 
33(1.2) 


12(3.2) 
8(3.2) 
1(1.8) 


43(6.3) 
48(6.8) 
47(3.4) 


41(8.7) 
27(11.9) 
14(5.3) 


25(2.5) 
36(2.5) 
31(1.4) 


13(3.4) 
8(2.8) 
1(1.2) 


26(2.5) 
37(2.7) 
31(2.8) 


13(3.3) 
8(3.3) 
1(3.2) 



Let A be the set of all regularization parameters used to create the full path. The oracle 
regularization parameter A* is defined as 

A* := argmin;,^^ {FNR(A) + FPR(A)} . 
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r = r = 0.01 r = 0.05 

CDF transform CDF transform CDF transform 




0.0 0.2 OA 0.6 0.0 0.2 0.4 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 

FP FP FP 

Fig 1. ROC curves for the cdf, linear and power transformations (top, middle, bottom) using the 
Meinshausen-Biihlmann graph estimator, with deterministic data contamination at different levels (r=0, 
0.01, 0.05). Here n — 200 and d — 100. Note: "npn" is the original Winsorized normal-score nonparanormal 
estimator from Liu et al. (2009); "normal" is the naive Gaussian graph estimator; "Spearman" represents 
the nonparanormal SKEPTIC using Spearman's rho; "Kendall" represents the nonparanormal SKEPTIC using 
Kendall's tau; "npn-ns" represents the normal-score based nonparanormal estimator. 

The oracle score is defined to be FNR(A*) + FPR(A*). To illustrate the overall performance 
of the studied methods over the full paths, the averaged ROC curves for n = 200, d = 100 
over 100 trials are shown in Figures 1 to 4, using (FNR(A), 1 — FPR(A)) . For each figure five 
curves are presented, corresponding to npn, npn-tau, npn-spearman, npn-ns, and normal. 

Let FPR := FPR(A*) and FNR := FNR(A*), Tables 1 to 4 provide numerical comparisons 
of the three methods on datasets with the different transformations, where we repeat the 
experiments 100 times and report the average FPR and FNR values with the corresponding 
standard errors in the parenthesis. 
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Fig 2. ROC curves for the cdf, linear and power transformations (top, middle, bottom) using the glasso 
graph estimator, with deterministic data contamination at different levels (r=0, 0.01, 0.05). Here n — 200 
and d = 100. . Note: "npn" is the original Winsorized normal-score nonparanormal estimator from Liu 
et al. (2009); "normal" is the naive Gaussian graph estimator; "Spearman" represents the nonparanormal 
SKEPTIC using Spearman's rho; "Kendall" represents the nonparanormal skeptic using Kendall's tau; "npn- 
ns" represents the normal-score based nonparanormal estimator. 



To further illustrate the estimation efficiency loss of the nonparanormal skeptic (npn- 
spearman and npn-tau) compared with the normal-score based estimation methods (npn and 
npn-ns), in Figure 5 we compare these methods on a higher dimensional Gaussian dataset 
with n = 100, d = 200 with no outlier added in. In the following we provide detailed analysis 
based on these numerical simulations. 
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CDF transform 



CDF transform 



CDF transform 




0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 

FP FP FP 

Fig 3. ROC curves for the cdf, linear and power transformations (top, middle, bottom) using the glasso graph 
estimator, with random data contamination at different levels (r=0.05, 0.1, 0.2). Here n — 200 and d — 100. 
. Note: "npn" is the original Winsorized normal-score nonparanormal estimator from Liu et al. (2009); 
"normal" is the naive Gaussian graph estimator; "Spearman" represents the nonparanormal SKEPTIC using 
Spearman's rho; "Kendall" represents the nonparanormal skeptic using Kendall's tau; "npn-ns" represents 
the normal-score based nonparanormal estimator. 

5.2.1. N on- Gaussian Data with No Outliers 

From the power transformation and CDF transformation plots in Figures 1 to 4, we see 
that, when the contamination level r = 0, the performance of the nonparanormal SKEPTIC 
estimators (npn-spearman and npn-tau) and the previous normal-score based nonparanormal 
estimators (npn, and npn-ns) are comparable. In this case, all these methods significantly 
outperform the corresponding parametric methods (the graphical lasso and Meinshausen- 
Biihlmann procedure). 

From Tables 1 to 4, we could see that in terms of oracle FPR and FNR, npn-ns and npn 
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Fig 4. ROC curves for the cdf, linear and power transformations (top, middle, bottom) using the 
Meinshausen-Biihlmann graph estimator, with random data contamination at different levels (r=0.05, 0.1, 
0.2). Here n — 200 and d — 100. Note: "npn" is the Winsorized normal-score nonparanormal estimator from 
Liu et al. (2009); "normal" is the naive Gaussian graph estimator; "Spearman" represents the nonparanor- 
mal SKEPTIC using Spearman's rho; "Kendall" represents the nonparanormal SKEPTIC using Kendall's tau; 
"npn-ns" represents the nonparanormal SKEPTIC using normal-score rank correlation coefficient. 



seem slightly better than npn-spearman and npn-tau, though it is qualitatively undetectable 
based on eyeball examination of the ROC curves. 

5.2.2. N on- Gaussian Data with Low Level of Outliers 

When the outlier contamination level is low (r = 0.01 for the deterministic contamination 
and r = 0.1 for the random contamination), the performances of the nonparanormal SKEPTIC 
(npn-spearman and npn-tau) are significantly better than those of npn and npn-ns. Still, all 
these semiparametric methods significantly outperform the corresponding parametric meth- 
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no transform 




Fig 5. ROC curves for the linear transform using the glasso algorithm, with no contamination. Here d—200, 
and n=100. Note: "npn" is the Winsorized normal-score nonparanormal estimator from Liu et al. (2009); 
"normal" is the naive Gaussian graph estimator; "Spearman" represents the nonparanormal SKEPTIC using 
Spearman's rho; "Kendall" represents the nonparanormal SKEPTIC using Kendall's tau; "npn-ns" represents 
the nonparanormal SKEPTIC using normal-score rank correlation coefficient. 



ods (the graphical lasso and Meinshausen-Biihlmann procedure). Similar patterns can also 
be found based on the quantitative comparisons from Tables 1 to 4. 

5.2.3. Non-Gaussian Data with Higher Level of Outliers 

Again, based on the power transformation and CDF transformation plots in Figures 1 to 
4, we see that when the data contamination level is higher (r = 0.05 for the deterministic 
contamination and r = 0.20 for the random contamination), the performances of the non- 
paranormal SKEPTIC (npn-spearman and npn-tau) are significantly better than those of npn 
and npn-ns. For this high outlier case, npn-tau outperforms npn-spearman, suggesting that 
the Kendall's tau is more robust than the Spearman's rho statistic. The parametric methods 
(the graphical lasso and Meinshausen-Biihlmann procedure) again perform the worst. 

Unlike the previous low outlier case, the quantitative results from Tables 1 to 4 present 
interesting patterns. For deterministic contamination, we do not see significant improvement 
of the npn-spearman and npn-tau over npn and npn-ns in terms of oracle FPR and FNR. At 
the first sight this seems counter-intuitive since the corresponding ROC curves suggest that 
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Table 2 

Quantitative com,parison of the 5 methods on simulated datasets using different nonparanormal 
transformations. The graphs are estimated using the Meinshausen-Buhlmann algorithm with deterministic 
data contamination. Note: "npn" is the Winsorized normal-score nonparanormal estimator from Liu et al. 
(2009); "normal" is the naive Gaussian graph estimator; "Spearman" represents the nonparanormal 
SKEPTIC using Spearman's rho; "Kendall" represents the nonparanormal skeptic using Kendall's tau; 
"npn-ns" represents the nonparanormal skeptic using normal-score rank correlation coefficient. 
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npn-spearman and npn-tau arc globally better than npn and npn-ns. The main reason for such 
a result is that the oracle score point happens to be coincide with the intersection point of 
different ROC curves. On the other hand, for random contamination setting, we see that the 
performance of npn-spearman and npn-tau uniformly dominate that of the npn and npn-ns. 

5.2.4- Gaussian Data with No Outliers 

From the linear transformation plot in Figures 1 to 4, we see that when the outlier contam- 
ination level r = the performance of all these methods are comparable. Based on Tables 
1 to 4, we could see that in terms of oracle FPR and FNR, normal, npn-ns and npn are 
slightly better than npn-spearman and npn-tau. This result suggests that there is very tiny 
efficiency loss of the nonparanormal skeptic for truly Gaussian data (though this loss seems 
negligible). Such an efficiency loss is visualized by Figure 5 where n — 100 and d — 200. 
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Table 3 

Quantitative comparison of the 5 methods on simulated datasets using different nonparanormal 
transformations. The graphs are estimated using the glasso algorithm with random data contamination. 
Note: "npn" is the Winsorized normal-score nonparanormal estimator from, Liu et al. (2009); "normal" is 
the naive Gaussian graph estimator; "Spearman" represents the nonparanormal skeptic using Spearman's 
rho; "Kendall" represents the nonparanormal skeptic using Kendall's tau; "npn-ns" represents the 
nonparanormal skeptic using normal-score rank correlation coefficient. 









npn 




npn- 


ns 


normal 


spearman 


kei 


idall 


tf 


r 


n 


FPR(%) 


FNR 


PPR 


FNR 


FPR 


FNR 


FPR 


FNR, 


FPR, 


FNR 


cdf 


0.05 


100 
200 
500 


16(3.6) 
10(2.2) 
4(2.1) 


24(4.9) 
12(3) 
4(2.6) 


17(4.4) 
11(2.6) 
5(2.1) 


26(5.7) 
14(3.6) 
6(2.7) 


27(12.9) 
26(10.9) 
22(8.3) 


57(13,3) 
51(12.6) 
40(13.9) 


16(3,9) 
10(2.8) 
4(2.1) 


23(4.8) 
11(3.2) 
4(2.2) 


16(4.1) 
9(2.6) 
4(2) 


23(5) 
11(3.3) 
4(2.1) 




0.10 


100 
200 

500 


19(5) 
15(3.8) 

7(2.3) 


36(6.2) 
21(4.6) 

9(2.7) 


20(4.9) 
16(3.9) 

9(2,4) 


37(6.3) 
25(5.1) 

12(3.1) 


30(17.4) 
29(13.2) 

27(11.3) 


69(18) 
56(13.3) 

50(13) 


17(4.8) 
13(3.3) 

6(1.9) 


33(6.1) 
18(4.6) 

7(2,2) 


18(4.8) 
13(3.5) 

6(2.1) 


33(6.2) 
18(4.6) 
6(2.2) 




0.20 


100 
200 
500 


28(7.9) 
24(6.7) 
17(3.5) 


47(8.2) 
39(7.5) 
23(4.6) 


29(7.5) 
28(6.7) 
20(4.7) 


48(8.2) 
39(6.9) 
28(6) 


30(19.2) 
31(17.8) 
34(16.4) 


64(20.4) 
61(18.6) 
54(15.6) 


24(7.8) 
20(5.8) 
13(3.6) 


50(8.2) 
37(6.7) 
19(4.4) 


24(7.9) 
19(6.7) 
12(3.3) 


49(7.8) 
37(6) 
19(4.2) 


linear 


0.05 


100 
500 
200 


15(3.5) 
5(2.4) 
10(2.3) 


26(4.6) 
4(1.9) 
13(3.4) 


16(4.6) 
5(2.4) 
11(2.6) 


26(4.7) 

6(2) 
14(3.4) 


23(6.3) 
10(2.7) 
16(4.3) 


38(6.7) 
12(3.7) 
27(8.4) 


15(3.6) 
4(2.2) 
9(2.5) 


23(4.6) 
3(1.7) 
11(3) 


14(3.2) 
4(2.2) 
9(2.2) 


24(4.6) 
3(1.6) 
11(3.2) 




0.10 


100 
200 
500 


19(4.8) 
14(4) 
8(2.1) 


35(6) 
22(4.5) 
9(2.7) 


20(5.4) 
16(3.8) 
10(2.6) 


37(6.3) 
26(4.2) 
11(3.2) 


28(10.2) 
24(6.6) 
19(4.6) 


48(9.6) 
40(7.1) 
24(4.8) 


19(4.6) 
13(3) 
6(1.9) 


32(5.2) 
18(4.2) 
7(2.4) 


18(4.6) 
12(3.1) 
6(2.2) 


32(5.3) 
18(4.3) 
6(2.3) 




0.20 


100 
200 
500 


28(7.6) 
25(5.1) 

18(4) 


48(7.8) 
37(6.5) 

23(5.2) 


30(9) 
30(6.6) 

22(4.8) 


47(8.8) 
36(7) 

25(5.4) 


36(18) 
32(11.4) 

27(7.4) 


53(17.6) 
50(11.6) 

41(8.2) 


24(7.6) 
19(5.3) 

13(3.8) 


49(7.6) 
37(6.3) 

19(4.2) 


24(7) 
18(4.8) 

13(3.5) 


49(7.2) 
38(6.7) 

19(4.2) 


power 


0.05 


100 

200 
".OU 


l.-,(4..-,) 
10':i.2) 
4(2, 2J 


25(5.7) 

1. ■!(:!. 7) 
4(1, h) 


16(4.4) 

~><:^) 


25(5) 
14i:i.5) 

.5(i,!)J 


:i:i(13.2) 

:!il|.^.4) 
■>^(ii.:)) 


55(13.9) 

52(^,<)) 
:i!J(,s,l) 


15(4,1) 
!)(2.s ! 
4(2) 


2:i(4,K) 
,L2(:i,4! 
Oil. 7) 


10(4,3) 
4(2, ij 


22(5,1) 

ll(:i,2) 
:!(i,7) 




0.10 


100 
200 
500 


20(4.9) 
14(4.1) 

7(2.2) 


36(6.7) 
22(6.2) 

9(2.7) 


20(6) 
16(3.8) 
8(2.2) 


36(6.4) 
23(5.1) 

10(2.9) 


38(22.2) 
39(16.4) 
37(11.7) 


56(22.6) 
52(17.3) 

46(12.1) 


18(5.2) 
13(3.9) 

6(1.7) 


32(5.7) 
19(4.5) 

6(2,2) 


18(6.1) 
12(3.7) 

5(1.7) 


32(6.8) 
18(4.1) 
6(2.1) 




0.20 


100 

200 
".OIJ 


27(7.7) 
24(01 
1S(4) 


48(9.5) 
:i8(7.2) 


30(8.4) 
27(.-.,i)l 
2(H4,2) 


47(9,9) 

,'iS(7.,'i) 


42(24.8) 
41(24,4) 
41(il,,9j 


54(25.6) 

54(25) 
,5iU7,7) 


22(7,:!) 
20(4,7) 
l:i(:i,0) 


50(8,9) 
:!7(5,,5) 
10(4, :i) 


2:5(8) 
19(5,1) 
12(0, ij 


49(9,2) 
:i0(5,8) 
i!)(4,:!) 



5.2.5. Gaussian Data with Low and Higher Levels of Outliers 

Prom the linear transformation plot in Figures 1 to 4, we see that when the outlier con- 
tamination level r > 0, the performance of the parametric methods like the graphical lasso 
immediately decreases. The main reason is that these methods are based on the Pearson's 
correlation matrix, which is very sensitive to outliers. In contrast, the other semiparametric 
methods (npn-spearman, npn-tau, npn-ns, and npn) are more resistant to outliers. Among 
them, npn-tau is the most robust one and npn-spearman behaves similarly. Both methods 
outperform npn, which further outperforms npn-ns. 

In summary, these simulation results illustrate an interesting tradeoff between statistical 
efficiencies and estimation robustness. In general, both npn-spearnnan and npn-tau have very 
good overall performance. In practice, which method to use should be determined by our 
prior knowledge about the data. For example, for high-throughput genomics datasets, we 
believe that using npn-spearman and npn-tau are more beneficial than using some less robust 
methods like npn-ns. In contrast, if we believe the data is free of outlier, a normal-score based 
method like npn could be a good choice. 
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Table 4 

Quantitative comparison of the 5 methods on simulated datasets using different nonparanormal 
transformations. The graphs are estimated using the Meinshausen-Biihlmann algorithm with random data 
contamination. Note: "npn" is the Winsorized normal-score nonparanormal estimator from Liu et al. 
(2009); "normal" is the naive Gaussian graph estimator; "Spearman" represents the nonparanormal 
SKEPTIC using Spearman's rho; "Kendall" represents the nonparanormal skeptic using Kendall's tau; 
"npn-ns" represents the nonparanormal skeptic using normal-score rank correlation coefficient. 
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5.3. Gene Expression Data 

We compare different metfiods on a large genomics dataset (Rafa's Ref is needed ). In tfiis 
study, we collect in total 13,182 publicly available microarray samples from Affymetrix's 
HGU133a platform. These samples are downloaded from GEO and Array Express. Our 
dataset contains 2,717 tissue types (e.g., lung cancer, stem cell etc.). For each array sample, 
there are 22,283 probes, corresponding to 12,719 genes. To the best of our knowledge, this 
is thus far the largest microarray gene expression dataset that has been collected. 

The main purpose of this study is to estimate the conditional independence graphs over 
different genes and different tissue types. To estimate the gene graph, we treat the 13,182 
arrays as independent observations and the expression value of each gene as a random vari- 
able. To estimate the tissue graph, we average all the arrays belonging to the same tissue 
type and treat this tissue type expression as a random variable. In this setting the 12,719 
gene expressions are treated as independent observations. Though it is obvious that both the 
genes and tissue types are not independent, we simply adopt this approach as our working 
procedure. This is consistent with the current state-of-the-art genomics practice. 



26 



Two major challenges for conducting statistical analysis on large-scale integrated datasets 
are data cleaning and batch/lab effects removal. Wc conduct surrogate variable analysis (Leek 
and Storey, 2007) on this data to remove batch effects and normalize the data from different 
labs. Since the main purpose of this paper is to compare different methods on empirical 
datasets. We mainly focus on presenting the differential graphs between different methods. 
The detailed data preprocessing protocols and the scientific implications of the obtained 
results will be reported elsewhere. 

We first screen out all the genes whose marginal standard deviation is below a given 
threshold. Such a procedure provides us a list of 2,000 genes which vary the most across 
different array samples. To estimate the gene graph, we first calculate the full regularization 
path for 100 tuning parameters using the npn-spearman and automatically select the tuning 
parameter using a stability based approach named StARS (Liu et al., 2010). The delivered 
graph contains 1,557 edges. We then examine the full regularization paths of the other graph 
estimation methods and pick the graph that has closest sparsity level as this graph. 

To estimate the tissue network, we first remove all the data for tissue types which have 
less than 5 replications. So we end up studying the relationships of 2,714 tissue types. We 
only use the 2,000 filtered out genes to estimate the tissue network. After averaging the 
array samples belonging to the same tissue type, we obtain a final data matrix with size 
2, 000 X 2, 714. The remaining procedure of estimating the tissue graph is the same as that of 
estimating the gene graph. Some summary statistics of the estimated gene and tissue graphs 
are presented in Table 5. 

Table 5 

Some summary statistics of the HGUlSSa platform data networks learned at the gene and tissue levels. 
Note: GA:= normal; SP:=spearman; NS:= npn-ns. A> B means the number of edges only appear in the 
estimated graph of A, but not in that of B; A < B is on the contrary. 







Edge No. 






Edge 


difF 




Network 


dim 


spearman normal 


npn-ns 


SP > GA 


SP < GA 


SP > NS 


SP < NS 


Tissue 


2714 


2639 2379 


2478 


602 


342 


307 


146 


Gene 


2000 


1557 1550 


1411 


1235 


1228 


691 


545 



Prom Table 5 we see that the estimated tissue graph is denser than the gene graph. 
Since both graphs contain around 2,000 nodes with more than 1,500 edges, it is not very 
informative to visualize these estimated graphs. Instead we are interested in understanding 
the differential graphs. 

For example, at the gene level, the npn-sp graph contains 1,235 edges that are not in the 
normal graph. In contrast, the normal graph contains 1,228 edges that are not in the npn-sp 
graph. Since there are 1, 235/1, 557 ~ 80% edges in npn-sp that are not present in the normal 
graph, this suggests that the data are highly non-Gaussian. When we further compare the 



27 



(a) Gene network (npn-sp vs. normal) (b) Gene network (npn-sp vs. npn-ns) 




(c) Tissue network (npn-sp vs. normal) (d) Tissue network (npn-sp vs. npn-ns) 



Fig 6. Differential gene networks between different methods. For A vs. B, the red color represent the edges 
that only present in A but not in B, the black color represent the edges that only present in B but not in A. 
(These graphics are best visualized in color). 

npn-sp gene graph with the npn-ns graph, we found that there are 691/1, 557 ~ 45% edges 
that are not present in the npn-ns graph, suggesting that this data may contain high levels 
of outliers. Since this dataset is integrated from many sources, this is not surprising. 

Compared with the gene graphs, the tissue graphs present an interesting pattern. Even 
though the delivered tissue graphs are much denser than the gene graphs, there are only 
602/2, 714 f» 22% npn-sp edges that are not present in the normal graph. Also, there are only 
342/2, 639 ~ 12% edges in the normal graph that are not in the npn-sp graph. Such a result 
suggests that the data are still non-Gaussian. However, at the tissue level the data seems 
containing much stronger signal than that at the gene level (This may also be caused by 
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possible uninterpreted lab effects). The similar conclusion can be drawn when we compare 
the npn-spearman tissue graph with the npn-ns tissue graph. For better visualization, we plot 
the differential graphs in Figure 6. These plots visualize the difference between the estimated 
graphs and double confirm the above analysis. 

6. Conclusions 

Most methods for estimating high dimensional undirected graphs rely on the normality 
assumption. To weaken this overly restrictive parametric assumption, we propose the non- 
paranormal SKEPTIC, an improved estimator that obviates the need to explicitly estimate 
the marginal transformations, which greatly improves the statistical rate of convergence to 
the optimal parametric rate. Our analysis is non-asymptotic, and the obtained rate is min- 
imax optimal over certain model classes. The nonparanormal SKEPTIC can thus be used 
as a safe replacement of Gaussian estimators, even when the data are truly Gaussian. Be- 
sides theoretical analysis, extensive numerical simulations and empirical data analysis are 
also provided to illustrate the usefulness of our methods. The R package huge implementing 
the proposed procedures is available on the Comprehensive R Archive Network: http:/ /cran. 
r-project.org/. 
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Appendix A: Proofs of Main Results 



In this section we provide technical proofs. In the following we first prove Theorem 4.2 due 
to its simplicity. We then use this result to prove Theorem 4.1. 



A.l. Proof of Proposition 3.1 

Proof. The result on Tjk directly follows from the definition of Tjk- 

Here we prove the result holds for pj^. Since Fj{Xj) ~ Uniform[0, 1], we have pjk 
12E[Fj{Xj)Fk{Xk)] - 3. We can also easily show that 

E[l - - FkiXk))] = ^F,{X,)F,{X,)]. 

Moreover, we have 



Similarly, 



E[F,{X,)F,{X,)] 



E 
E 



P ( Xf^ < X^ 



(1) I 



P ( Xi^^ < XI'' I X, 



(1) I vW 



E 



{i{xf <xf\xf <xf^] 



k 

(1) 



(A.l) 



E[(1-F,.(X,-))(1-F,(X,))] 



= E 



P ( xf > X^ 



(1) I ^(1) 



P ( Xf ^ > X^ 



(1) I x^^^ 



E 



(A.2) 



Combining (A.l) and (A.2), we obtain 

E[Fj(Xj)F,(X,)] = lE[F,{X^)F,{X,)]+lE[(l-Fj{XMi-Fk(X,))] 



'(2) 



(1) 



'(3)^ 



= -C(j,l,2;/c,l,3). 



(A.3) 

(A.4) 



Therefore, we have 



Pjk = 12E[F,(A,)Ffc(Xfc)] -3 
= 3(2C(jU,2;A;,l,3)-l) 
= 3C(j,l,2;/c,l,3)-3D(j,l,2;A;,l,3). 



(A.5) 
(A.6) 



The last equality directly follows from the fact that C(j, 1, 2; k, 1, 3) = 1 — D(j, 1, 2; k, 1,3). □ 
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A. 2. Proof of Theorem 4-1 

Proof. The main difficulty of this analysis is that the Spearman's rho static is over rank 
variables which depend on all the samples. To handle this issue, we ffi^st rewrite the rho- 
statistic in a different form (see Page 318, Eq (9.21) of Hoeffding (1948)) 

g n n n 

= ;^337^EEE^ig"(4-^9(4-4) (A.7) 

i=l s=l t=l 

n + 1 n + 1 

where Tjk is the Kenadall's tau statistic and 

- 2) ^g;^Sn(x} - (4 - 4)- (A.9) 

is a 3rd-order U-statistic with bounded but asymmetric kernel. 
Let < a < 1. We have 



P I sup \p^k - Epifcl > - W ^ ) (A.IO) 
jk n M n ' 



I , 2ac /logo? , ^ , , 

< d^F I \Ujk-EUjk\ > — Y ^ I +^2(a) 



Ti(a) 



where 



Ua) = ci^F I > - a)c^'^ 1 = (A.ll) 



97r 



2 



whenever n> — oi t- 

(1 — q;)^c^ log a 

Without loss of generality, we assume n can be divided by 3. Using Hoeffding's inequality 
with asymmetric kernels (Hoeffding, 1963), 



Let 



Ti(a) = d¥||t/,,-E^,,|>^y^| (A.12) 

< 2.^exp(-^.VL^J.^^ (A.13) 
= 2exp ( 21ogrf- ^^aVlogd ) . (A. 14) 



c= . (A.15) 

a 
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Therefore, whenever n > 



1 



6 log d \1 — a 



a 



, with probabihty at least 1 — 2d ^, we have 



i_ TC'- I ^ 6\/6 /logo? 

sup \pjk-^Pjk\ < 

jk a 



n 



(A.16) 



Unlike Tjk which is an unbiased estimator of Tj^, is a biased estimator. To prove the 
desired result, we apply the following bias equation from Zimmerman et al. (2003): 



7r(n + 1) 
Equivalently, we can write 

E°fc = 2 • sin {^^Pjk + %fe) , 
nMpji; — 2 ■ arcsin (S^j^.) 



arcsin + (n — 2) arcsin (-^) 



where a 



jk 



2{n - 2) 

n > 1-2 (which implies that |ajfe| < 



. It is easy to see that \ajk\ < 



n-2 



(A.17) 



(A.18) 



. Therefore, for all 



P ( sup 

jk 



'^jk ^jk 



> t 



= d^¥ 



2 sin i^Pjkj - 2 sin {^^pjk + %fe) > ^) 



,2^ A- ^^6,3 

< dr\ \pjk - Epjfc ajk\ > -t 

\ 71 71 

Ottt, f \^ ir,^ I 3 I 6 

< d P ( \pjk - ^Pjk\ > -'^ - I 

< d'P - Ep,fc| >^t- 



= ci^P |pjfe-Ep,fc| >-t\. 



TT 



Thus we get the desired result. 



(A.19) 

(A.20) 
(A.21) 
(A.22) 
(A.23) 

□ 



A. 3. Proof of Theorem 4.2 

Proof. It is easy to see that Tjk is an unbiased estimator of Tjk'- Et^^ = Tjk- We have 



P 



^jk ^jk 



> t 



= P 



[ -Tjk J - sm \^-Tjk 



TT 



> t 



(A.24) 
(A.25) 
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< P \rjk-rjk\ > -t] . 



TT 



Since Tjk can be written in the form of U-statistic 

2 



(A.26) 



(A.27) 



l<i<i'<n 

where 

Kr{x\ x'') = sign yxj -Xj j [XI- xi 
is a kernel bounded between —1 and 1. Using Hoeff ding's inequahty for U-statistic, we get 



P(sup 



"^jk ^jk 



We then obtain (4.4). 



(A.28) 

□ 



Appendix B: Other Proofs 

In this Section, we prove the Theorem 3.1. 



B.l. Some Useful Lemmas 

Let $(•) and be the cumulative distribution function and density function of standard 
Gaussian. We start with some preliminary lemmas on the almost sure limit of the Gaussian 
maxima and the standardized empirical processes. 

Since gj = /r^ and fj{t) = {Fj{t)), we have gj{u) = Ff^ 

Lemma B.l. (Pickands (1963)) If z^, . . . ,z"' are i.i.d. standard Gaussian random variables, 
then 



(B.l) 



P 



v 



sup — \/ 2 log n sup z^ — \/ 2 log n 

. „ l<i<n 1 l<i<n 

lim mt ~ ~ o ' , = 

n-^oo loglogn/-\/21ogn 2 n->-oo loglogn/ySlogn 2 



1. 



This Lemma implies that, for any c > 0, for large enough n, the standard Gaussian 
random variables z^, . . . jZ"" satisfy 



sup z e 

l<i<n 



r- (\ \ loglogn /— , , \ loglogn 



a.s. 



In the following, we set c = - . Then for large enough n 



sup z e 

l<i<n 



/— loglogn /— loglogn 

V21ogn- , V21ogn+ 

V21ogn V21ogn 



almost urely. 
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For any 7 > 0, we define the following sub-intervals. 

hn-= 9j (^JaXogJi^ ,gj (^^//3lognj 



9j{0),9j Valogn 



and 



'3n 



9j 



logn ,9j 



7 logn 



with < a < 1 < ^ < 7/4 - 7. 
In the following, we define 



ATI log log n 

V21ogn 

,* ATI . log log n 

V 2 log n 



Lemma B.2. For all i e /i„ U U /sn, u;e /lawe 

P 



1-1 \ 

— < < 1 for large enough n = 1. 

n n / 

Proof. Prom Equation (B.l), we have for large enough n 

Taf ^\ loglogn /— loglogn 

P sup V21ogn- ,V21ogn+— == 
Vi<j<n L v21ogn v21ogn 

Using the definitions in (B.2) and (B.3), we have 



P ( sup x] e [gj «) , Qj {t*J] for large enough n ) = 1. 



1. 



l<i<n 



Therefore 



P ( sup ^ Iin U hn U hn for large enough n] — 1. 



l<i<n 



(B.2) 
(B.3) 



From the definition of Fj, only the values greater or equal to the supi<j<„x*- are truncated. 
The result then follows. □ 

The next technical lemma adapted from the Chapter 16 of Shorack and Wellner (1986) 
characterizes the almost sure limit of the standardized empirical process. 

Lemma B.3. (Almost Sure Limit of the Standardized Empirical Process) Consider a se- 
quence of sub-intervals [Ln \ Un^] with both Ln^ — gj (-\/ a log n) t 00 o-nd Un^ — gj [y/ p log n) t 
OO; then for 0<q;</3<| — 7 



lim sup 



n 



sup 

'L'riUt<uk 



21oglogn,ov,^,,o) 



^F,m-F,{t)) 



C a.s. 



where 0<C < 2^2 is a constant. 
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Proof. This result follows from a combination of Theorem 1 in Section 2 (Chapter 16 ) and 
Theorem 2 in Section 3 (Chapter 16) of Shorack and Wellner (1986). □ 

The following lemma characterizes the behavior of a random sequence using a determin- 
istic one. 

Lemma B.4. For any < a < 2, we have 

Umsup < c ataost surely, 

where C > is some constant. 

Proof. We only need to consider the case Fj > Fj. 
First, for large enough n 

<J.^7n^r + 4^^^y„-"\ (B.4) 
This is true since 



' (^){^/cx[ogn) 
V a log n 



^V«logn + 4y^^^j 



= (t){^/a logn) ■ 



alogn 8 log logn / a (logn) (log logn) 



where i?„ = 1 — o(l). 
Therefore, 



Also, 



'^(v^alogn) n 



y/a log n (27ralogn)V4' 

Thus, equation (B.4) follows from equations (B.5) and (B.6). 
Further, using the fact that 

1 _ < ^ if t > 1, 



(B.6) 
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we have 



4y'!5«^«!^v'i-4(v'^) (B.7) 

V n y log n 

, / / — ^ , /loglogn\ /log log n „s 

< 4•0(^^/^ + 4^J^j^J3^ (B.9) 

< $^y^b^ + 4^^^^j -$(7^1^), (B.IO) 



where the last step follows from the mean value theorem. 
Thus 



log n) + 4 ^/ ^ 1 - log n) (B.ll) 

'log logn 
V2" 



10 



Since $ is a monotone function, 



I $(7^1^) + 4^ ^"^|^^'' \/l-$(AAbg^) (B.12) 



/ — ; , /log logn 

< y^ + 4^-^. 

Using the fact that 

F,{9,{t)) = 

we have 



I — , /log logn 

<v/^ + 4^-^. 



From Lemma B.3, for large enough n, 
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Therefore 



log log n 



Finally, we have that 

(^-7 (f,(^,(a/^i^))) 



\ 



O (n"/2) 



n, 



This finishes the proof. 



B.2. Proof of Theorem 3.1 

Proof, of Theorem 3.1. 

Due to symmetricity, we only need to conduct analysis on a sub- interval of G In' 



P := 

n 



9j (0) , 9j 



7 



7 logn 



Recall that for any < 7 < 1, we define 
hn '■ 



9j{0),9j (^\/alognj hn 9j (^\/a\ognj ,gj (^^//3\ognj 



and 



'3n • = 



7, (\//51ogn) ,9jiJ(^-l^ log n 



with < a < 1 < /3 < 7/4 - 7. 

1 ~ 1 

By Lemma B.2, we know that on /i„ U hn U hn, — ^ Fj(t) < 1 for large enough 

n n 
almost surely. Therefore, we only need to analyze the term 



sup 

tehnUhnUhn 



(F,(t))-$-MF,(i)) 



First, we consider the term 



sup 

tehn 
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Since is continuous between mm{Fj{gj{0)), Fj{gj{0))} and 

max{Fj {gj{^Ja log n) ) , Fj {gj{'\/a log n) } and is differentiable on the corresponding open set, 
by the mean- value theorem, for some ^nti such that 



in,t e min{Fj {g^ (0) ) , Fj {g^ (0) ) }, max{Fj {gj ( v^a logn) ) , Fj {gj ( a logn) ) } 
we have 

sup F,{t)) - (Fj(t)) 

By Lemma B.4, the following inequality holds almost surely: 



— sup 



< ($-^)' (^max{F, [^g, ^y/alogn)),F, [^g, ^^alogn))} 

< C($-7(F,(^,■(^^b^))) 

C 

(j) (Vet log n) 

where C and Ci are some generic constants and is the standard Gaussian density function. 
Using the Dvoretzky- Kief er- Wolf owitz inequality, we have 



sup 

tehn 



Next, we consider the term 

sup 

Using Lemma B.3, for large enough n. 



log log n 



n 



l-a 



sup 

tehn 



F,{t)-F,it)\ = Op y^^^ ^1 - F, (^,(y^d^ 



log log n 



-a/2 



n 



V Oi log n 



= Op 

Using the same reasoning as before, we have 

sup $-^(F,(t))-$-^(F,.(i)) 
tehn ^ ' 



log log n 



a/2+1 



log log n 



n 



l+a/2-/3 
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Similarly, we have 



sup 
tehn 



log log n 

^/3/2-3/4+7 



We choose 



/3 = - — 7 and a = 1 — 7, 



all terms vanish. We get the desired result. 



□ 
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